Red Wine Quality Exploration by Leo Silva

Univariate Plots Section

## [1] 1599
## 'data.frame':    1599 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ quality.factor      : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality      quality.factor
##  Min.   : 8.40   Min.   :3.000   3: 10         
##  1st Qu.: 9.50   1st Qu.:5.000   4: 53         
##  Median :10.20   Median :6.000   5:681         
##  Mean   :10.42   Mean   :5.636   6:638         
##  3rd Qu.:11.10   3rd Qu.:6.000   7:199         
##  Max.   :14.90   Max.   :8.000   8: 18

I’ll start by plotting a histogram of the quality variable to check how it’s distributed.

Now that I have the above histogram I’ll plot the histogram of other variables present in the dataset to check the distribution of each one. Maybe some have distributions that look like the one above? Let’s check.

fixed.acidity: Normal distribution. 20 outliers have been discarded.

volatile.acidity: Depending on the binwidth used here you can think this one is a normal distribution but it’s clearly a bimodal histogram. 21 outliers have been discarded.

citric.acid: Lots of values are equal to zero with another peak at around 0.5. Looks like a plateau distribution with some peaks at round numbers.

residual.sugar: Right skewed distribution. 21 outliers > 8 have been discarded.

chlorides: Normal distribution with some outliers to the right. 41 outliers have been discarded in this plot.

free.sulfur.dioxide: Right skewed distribution. 4 outliers > 50 have been discarded.

total.sulfur.dioxide: Right skewed distribution. 2 outliers > 200 have been discarded.

density: Normal distribution.

pH: Normal distribution. 7 outliers have been discarded.

sulphates: Right skewed distribution. 8 outliers > 1.5 have been discarded.

alcohol: Right skewed distribution. 2 outliers have been discarded.

Looking only at the Univariate Plots above we can not say which variables had more influence over quality. We’ll more about that in the Bivariate Plots Section.

Univariate Analysis

Introduction

What is the structure of your dataset?

This dataset has 1599 entries of the red Portuguese “Vinho Verde” wine containing 12 variables as below:

1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)

The quality variable is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). So quality is a qualitative (categorical) variable. The other variables are the results objective tests (e.g. PH values).

What is the main feature of interest in your dataset?

The main feature in this dataset is the quality of the wine and I’m particularly interest in finding which variable(s) had influenced the most in the quality of those wines.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The answer to that will come in the Bivariate Plots Section. Anything I say based only in the Univariate Plots would be mere speculation.

Did you create any new variables from existing variables in the dataset?

Yes, I created the variable quality.factor which is the quality variable casted into the factor format. That may help if I want to plot some box plots using the quality variable in the x axis.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I didn’t see any unusual distribution but I did see some variables with very high peaks histograms such as residual.sugar and chlorides. Other variables have distributions that look like the quality histogram such as fixed.acidity.

Bivariate Plots Section

The objective here is to see how each of the variables in the dataset relates with quality. I’ll begin by plotting a correlation matrix using the corrplot library and check which of the variables have more chance of being related to quality.

If you look how quality correlates with other variables you will notice that alcohol and volatile.acidity have the highest correlations with the former having a positive correlation and the latter a negative correlation.

I’ll go a bit further into this and will not restrict the analysis to alcohol and volatile.acidity. Below I’m going to plot scatter plots with regression lines using linear model and to complement them I’ll plot bloxplots using quality.factor.

quality vs fixed.acidity: No correlation is shown in those plots.

quality vs volatile.acidity: Those 2 plots above confirm what we see in the correlation matrix. quality and volatile.acidity have a strong negative correlation.

quality vs citric.acid: This relationship does not stand out in the correlation matrix but here we can see that quality and citric.acid do correlate with each other.

quality vs residual.sugar: The flat horizontal regression line indicates no correlation here.

quality vs chlorides: There is some correlation here but it’s too weak to be taken into consideration. Also the number of outliers with quality 5 and 6 is very high.

quality vs residual.sugar: The box plot is clear that there is no correlation here.

quality vs total.sulfur.dioxide: Same as above. The box plot is clear that there is no correlation here.

quality vs pH: Correlation is too weak. The linear regression line is almost flat and box plot medians have a trend but it’s not strong enough.

quality vs density: There is some correlation but it is too weak. The linear regression line indicates a direction but box plot medians doesn’t confirm it from quality=4 to quality=5. There the medians go is opposite from what the linear regression line indicates. The small number of data points with quality=3 and 4 is a good sign.

quality vs sulphates: Here I can see some level of correlation here based on the linear regression line which is not objected by the box plot medians. The number of outliers do stand out here which makes the regression not so reliable.

quality vs alcohol: The strongest correlation is here. The linear regression line has a clear trend that confirms the correlation matrix. From quality=3 to quality=4 the box plot medians go against the correlation direction but the number of data points with quality=3 to quality=4 doesn’t support it as a reason to discard the correlation.

Bivariate Analysis

Analyzing the plots above I noticed that quality does not correlate with most of the other variables. fixed.acidity, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide and pH.

The plots of volatile.acidity, citric.acid, density, sulphates and alcohol show that they do correlate with quality with alcohol being the variable that correlates with quality the most followed by volatile.acidity, just like the correlation matrix had indicated.

Multivariate Plots Section

quality vs alcohol vs density: I’ve chosen alcohol and density here because they have a high correlation (from the correlation matrix). The plot shows the trend that high quality wines tend to have high alcohol and low density but it’s too spread and with a considerate number of outliers.

quality vs citric.acid vs volatile.acidity: Again the plot shows a trend but it’s inconclusive due to the number of outliers. High quality wines tend to have low volatile acid and high citric acid.

quality vs alcohol vs volatile.acidity: The two variables that have the strongest correlation with quality are alcohol and volatile.acidity and thus this plot makes a lot of sense. The trend is clear and the number of outliers are less than with the two other Multivariate Plots above. Clearly wines of high quality tend to have high alcohol and low volatile acidity.

Multivariate Analysis

I decided to scatter plot the relationship between quality, other variables with strong relationships with quality, volatile.acidity and alcohol.

They confirm the correlation numbers, some with a more dense plot, others are more spread. Some have more outliers than others, of course, but they confirm the correlations listed in the Bivariate Analysis Section and graphically show their relationships with each other and with quality.

Final Plots and Summary

Plot One

Description One

quality and alcohol correlation is very clear with these plots. The linear regression line shows graphically this correlation which is confirmed by the box plot.

Plot Two

Description Two

The negative correlation between quality and volatile.acid is shown in those plots. The regression line and box plot confirm it.

Plot Three

Description Three

This is the surprise. This correlation didn’t stand out in the correlation matrix but when see those plots it’s clear that this correlation exists and is strong enough to be taken into consideration.


Reflection

I’ve found that those 3 variables (alcohol, volatile.acidity and citric.acid) have the strongest correlations with the wines quality score. To take this analysis a step further I would try to create a regression model using those 3 variables to calculate the wine’s quality based on the values of those variables.

This project has been of great value as it’s challenged me to learn more about correlations, plots, R libs, ggplot features, histograms distributions and more. With this hands on experience I feel more confident in exploring other data sets.